Understanding Dataset and Objective

The wine industry is a lucrative industry which is growing as social drinking is on rise. There are many factors that make the taste and quality of wine unique. These factors are but now limited to the followings:

acidity pH level sugar remaining in wine chlorides density

In this project we use a dataset of wines. In this dataset there are 4898 observations of White Wines that are produced in Portugal. Different properties of each wine is tested and collected for this dataset. Also, Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

In this project, I try to understand this dataset better and also try to find out if there is a relationship between quality of wine and its different properties.

# Read the csv file, as well as summary of the data.
getwd() #check the route
## [1] "/Users/KunWuYao/Desktop/Python shortcut/Python Exercise ver3/Udacity/Data Analyst Nanodegree/Project4"
wwd <- read.csv('wineQualityWhites.csv') #read the csv file
str(wwd) #check the structure
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
summary(wwd) #show the summary
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
#global options
#For future reference, note that you can use global options (instead of specifying in each chunk) to 
#suppress code, warnings, and messages output by R so that it doesn't appear in your knit HTML 
#file with the following:

# {r global_options, include=FALSE}
# knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)

There are 4898 observations and 12 features. Input variables which includes 11 chemical features of white wine and output variable is wine quality.

Below is brief description of each feature input variables (based on physicochemical tests):

Chemical properties:

fixed acidity: most acids involved with wine are fixed or nonvolatile (do not evaporate readily) (tartaric acid - g / dm^3)

volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste (acetic acid - g / dm^3)

citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines (g / dm^3)

residual sugar: the amount of sugar remaining after fermentation stops (g / dm^3)

chlorides: the amount of salt in the wine (sodium chloride - g / dm^3

free sulfur dioxide: he free form of SO2 that exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion (mg / dm^3)

total sulfur dioxide: amount of free and bound forms of SO2 (mg / dm^3)

density: the density of water is close to that of water depending on the percentage of alcohol and sugar content (g / cm^3)

pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic)

sulphates: a wine additive which can contribute to sulfur dioxide gas (SO2) levels (potassium sulphate - g / dm3)

alcohol: the percentage of alcohol content in the wine (% by volume)

Output variable (based on sensory data):

quality (score between 0 and 10)

The summary figure above shows the distribution of data over different variables. As we can see, the normal range for fixed acidity is 6.3 to 7.3 g / dm^3. As for sugar, 75% of wines in our dataset have below 9.9 mg / dm^3 sugar remaining after fermentation stops. Average alcohol percentage in our dataset is about 10.51, and average density is about 0.9940.

#First check the quality distribution of the white wines
#Draw boxplot and the scatter plot to see the distribution
b_g_q <- ggplot(aes(x = 'Quality', y = quality), data = wwd) +
  geom_boxplot() 
#  geom_jitter(alpha = 0.5, size = 0.5, color = 'pink')
b_b_q <- ggplot(aes(x = quality), data = wwd) +
  geom_histogram() + stat_bin(binwidth = 1)
Q_box <- grid.arrange(b_g_q, b_b_q, nrow = 1)

ggsave('Q_box.jpg', Q_box)

As we can see from the boxplot and scatter plot, most white wines are rated from 5 to 7, some are rated at 4, 3 is extremely low, and some excellent wines are rated at 8 or above.

Next I would like to check the acidity

Based on the bottom-right figure, wines are acidic and their pH range from 2.72 to 3.82 according to the summary of our database; however, most wine have a pH between 3 and 3.5.

The acidic nature of wines can come from three different types of acids:

Fixed acidity which is for most cases between 6 and 8. Volatile Acidity which is mostly in range of 0.1 and 0.5 Citric Acidity which is ranging from 0 to 1 but for most of wines in our dataset is between 0.2 and 0.5

These features all seem to follow a normal distribution except Volatile Acidity which is slightly right skewed.

I will do some transformations and compare which one of the results would be more bell-shaped:

It seems that square root (volatile acidity) follows normal distribution (at least it is more bell-shaped than before and the other two adjustments); therefore we will use the square root transformation for our further analysis.

Second: Take a look at another 4 variables

Based on the above figures, the amount of sugar remaining after fermentation is rarely more than 20 g/dm^3.

The chlorides range in wines in our dataset is usually between 0 and 0.1 with some exceptions of more than 0.1 g/dm^3.

Density for wine is typically less than water but very slightly. The typical range for density would be (0.99, 1)

Alcohol percentage in wine varies between 8 and 14; however, for most of the wines it is between 9 and 13.

Besides, the distributions of residual sugar and alcohol are right-skewed.

Now that we can see the distribution more clearly, next I would like to try transforming the data to make it more bell-shaped.

Now chlorides and density are more like bell-shaped. However, alcohol is still a little right-skewed even though it is adjusted by log function. Residual sugar is far from normal distribution as shown above; it looks more like two different bells in the distribution.

Finally, the other 3 variables: sulphate, free sulfur dioxide, and total sulfur dioxide

From the graph we can see that except for some outliers, free sulfur dioxide and total sulfur dioxide look like bell-shaped, and sulphate is a little right-skewed.

Now it’s more bell-shaped.

After having a basic sense about the variables, I proceed to check the correlation between the input variables in our dataset:

Some observations are listed below:

Strong positive relationship between density and sugar remaining (0.839)

Positive relationship between total SO2 and free SO2 (0.616)

Negative relationship between alcohol and density (-0.78)

Features in our data seems to follow a normal distribution

To avoid multicollinearity in model building using regressions, we have to be aware of strong correlations among input variables.

After knowing that there are some variables related to others, I want to pick them up and add the quality factors to see if the correlations are different between various quality factors:

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"

When taking a deeper look at the dataset, I find that some correlation varies a lot between quality factors. For example, the correlation between alcohol and free sulfur dioxide is 0.295 for quality 9, but it is negative for other quality factors. We should bear in mind that the relationships between variables might vary from different quality scores.

We use Spearman’s rho statistic to estimate a rank-based measure of association. Correlation falls between -1 and 1. 0 suggests there is no association between the two variables while numbers close to -1 or 1 suggests strong negative and positive associations accordingly.

##                             [,1]
## fixed.acidity        -0.08448545
## volatile.acidity     -0.19656168
## citric.acid           0.01833273
## residual.sugar       -0.08206979
## chlorides            -0.31448848
## free.sulfur.dioxide   0.02371338
## total.sulfur.dioxide -0.19668029
## density              -0.34835102
## pH                    0.10936208
## sulphates             0.03331897
## alcohol               0.44036918

As we can see, only alcohol, density, and chlorides have weak correlation with quality, the others do not seem to correlate to the quality of wine. However, to discover more correlation between input variables and quality to predict the wine quality, I would like to dig deeper into the dataset.

It is difficult to find a specific pattern in this figure since quality has a wide range. To simplify the situation, I categorize the quality of wine into four groups of Poor, Normal, Good, and Great to be able to differentiate patterns in each category.

Below is how the quality of wines is distributed based on the rating that I just introduced:

table(wwd$rating)
## 
##   Poor Normal   Good  Great 
##    183   1457   2198   1060

Now again we plot the two features of pH and density but this time use the new rating to see a pattern between quality and these two features:

From the scatter plot we can see that most good and great wines have lower density, whereas pH has no valid correlation with wine quality. Next I would like to dig deeper into the relationships between density and wine rating.

After categorizing the density into 3 groups with 1st and 3rd quartiles, we can easily see that most wines with low density are rated as “Good” or “Great”, but poor and normal wines are still difficult to find connection with density.

From the distribution we can see that the alcohol by volume has no valid correlation with wine in poor quality, as well as the volume of citric acid has no valid connection with wine quality. They are widely spread all over the plot. To make the relationships more clear, let’s take a deeper look at it.

Percentage of getting good/great wines with alcohol percentage >= 14%:
(6+1)/(6+1+0+0) = 100%
Percentage of getting good/great wines with alcohol percentage between 13% and 14%:
(84+41)/(84+41+4+1) = 96.15%
Percentage of getting good/great wines with alcohol percentage between 12% and 13%:
(328+302)/(328+302+31+10) = 93.89%
Percentage of getting good/great wines with alcohol percentage between 11% and 12%:
(298+453)/(298+453+120+32) = 83.17%
Percentage of getting good/great wines with alcohol percentage >= 13%:
(6+1+84+41)/(6+1+0+0+84+41+4+1) = 96.35%
Percentage of getting good/great wines with alcohol percentage >= 12%:
(6+1+84+41+328+302)/(6+1+0+0+84+41+4+1+328+302+31+10) = 94.31%
Percentage of getting good/great wines with alcohol percentage >=11%:
(6+1+84+41+328+302+298+453)/(6+1+0+0+84+41+4+1+328+302+31+10+298+453+120+32) = 88.43%

The boxplot shows that most wines with alcohol volume equal and greater than 11% are rated as “Good” or “Great”. Now I want to combine the two findings we discovered above and see whether it is possible to find the distribution pattern. Also, the easiest way to pick good/great wine is choosing the wine with 14 alcohol percentage and above, though we have very few choices.

The plot looks pretty good now, but I want to make a difference between high and low alcohol volume with normal density, to make it more clear.

Percentage of getting good/great wines with alcohol percentage >= 14%:
(6+1)/(6+1+0+0) = 100%
Percentage of getting good/great wines with alcohol percentage between 13% and 14%:
(78+39)/(78+39+4+0) = 96.69%
Percentage of getting good/great wines with alcohol percentage between 12% and 13%:
(260+243)/(260+243+20+10) = 94.37%
Percentage of getting good/great wines with alcohol percentage between 11% and 12%:
(161+225)/(161+225+57+18) = 83.73%
Percentage of getting good/great wines with alcohol percentage >= 13%:
(6+1+78+39)/(6+1+0+0+78+39+4+0) = 96.88%
Percentage of getting good/great wines with alcohol percentage >= 12%:
(6+1+78+39+260+243)/(6+1+0+0+78+39+4+0+260+243+20+10) = 94.86%
Percentage of getting good/great wines with alcohol percentage >=11%:
(6+1+78+39+260+243+161+225)/(6+1+0+0+78+39+4+0+260+243+20+10+161+225+57+18) = 90.29%

Now according to the plot, we know that we have a great chance to get good or great wine if we choose one with higher alcohol percentage and lower density. It’s a great progress!

Again, poor quality wines are spread widely all over the plot, and great quality wines seem to contain less sugar. Let’s dig deeper into it.

Although many good and great wines contain less sugar, there are many normal and poor wines with low sugar content as well. There is no valid correlation between rating of wine and residual sugar volume. But surprisingly, we can increase the likelihood of choosing good or great wine with 13% alcohol and above to 100% by removing the normal and poor wines with low sugar volume.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.225   1.350   1.300   1.425   1.500

Now we can easily pick a one out of the 114 bottles and it would always be good or great wine, wow!

## 
##     (0,0.01]  (0.01,0.02]  (0.02,0.03] (0.03,0.035] (0.035,0.04] 
##            1           55          508          633          866 
##  (0.04,0.05]   (0.05,0.1]   (0.1,0.15] (0.15,0.347] 
##         1672         1053           51           59

Chance to get good/great wines when chloride volume <= 0.035:
(229+287+213+221+27+20)/(229+287+94+23+213+221+65+9+27+20+6+2+1) = 83.29%
Chance to get good/great wines when chloride volume < 0.030:
(213+221+27+20)/(213+221+65+9+27+20+6+2+1) = 85.28%

From the plot we can see that most wines with chlorides below or equal to 0.035 g / dm^3 are rated as “Good” or “Great”, but again, there is no valid correlation between poor wines and chlorides volume.

Percentage of getting good/great wines with alcohol percentage >= 14%:
(2+1)/(2+1+0+0) = 100%
Percentage of getting good/great wines with alcohol percentage between 13% and 14%:
(50+24)/(50+24+4+0) = 94.87%
Percentage of getting good/great wines with alcohol percentage between 12% and 13%:
(171+140)/(171+140+9+8) = 94.82%
Percentage of getting good/great wines with alcohol percentage between 11% and 12%:
(91+87)/(91+87+17+8) = 87.68%
Percentage of getting good/great wines with alcohol percentage >= 13%:
(2+1+50+24)/(2+1+0+0+50+24+4+0) = 95.06%
Percentage of getting good/great wines with alcohol percentage >= 12%:
(2+1+50+24+171+140)/(2+1+0+0+50+24+4+0+171+140+9+8) = 94.87%
Percentage of getting good/great wines with alcohol percentage >=11%:
(2+1+50+24+171+140+91+87)/(2+1+0+0+50+24+4+0+171+140+9+8+91+87+17+8) = 92.48%

Now we know how to pick good and great wines, but we still don’t know what makes the wines poor.

wwd_poor <- subset(wwd, rating == 'Poor' | rating == 'Normal') #set new dataset only contain rating poor and normal
wwd_poor$rating_score[wwd_poor$rating == 'Poor']  = 0 #set new variable
wwd_poor$rating_score[wwd_poor$rating == 'Normal'] = 1 #set new variable
cor(wwd_poor[, 2:12], wwd_poor$rating_score, method = 'spearman')
##                             [,1]
## fixed.acidity        -0.04987398
## volatile.acidity     -0.13164279
## citric.acid           0.05997837
## residual.sugar        0.15287359
## chlorides             0.02732656
## free.sulfur.dioxide   0.22755880
## total.sulfur.dioxide  0.13874452
## density               0.11007739
## pH                   -0.01441605
## sulphates             0.02849757
## alcohol              -0.11866946
# Try to see what make poor wines different from the normal ones

It’s pretty hard to find out what leads to a “Poor” rating in wines, let’s try making boxplots of each input variable in different ratings and see if we can find some clues:

From the boxplots we see that the 7th input variable, which is free sulfur dioxide, is more related to poor wines than the others. Next I would like to create a scatter plot to see if free sulfur dioxide and residual sugar have an impact on the quality of wine.

wwd_poor <- subset(wwd, rating == 'Poor') #creat a new dataset with only poor wines
table(wwd_poor$free.sulfur.dioxide)
## 
##     3     4     5     6     7     8     9    10    11  11.5    12    13 
##     3     4    12     9     6    10     7     6     4     1     1     6 
##    14    15    16    17    18    19    20    21    22    23    24    25 
##     3     6     3     4     7     3     4     2     2     5     4     6 
##    26    27    28    29    30    31    32    33    34    35  35.5    36 
##     2     2     3     4     2     2     3     1     7     2     1     2 
##    37    38    39    40    41    42    46    47    48    50    54    58 
##     3     1     1     1     1     1     2     1     1     3     1     2 
##    60    61    62    63    64    69    70    75 118.5 122.5   124 138.5 
##     1     1     1     2     1     1     2     1     1     1     1     1 
## 146.5   289 
##     1     1
# total poor wines below 10 is 57
table(wwd$free.sulfur.dioxide)
## 
##     2     3     4     5     6     7     8     9    10    11  11.5    12 
##     1    10    11    25    32    25    35    29    55    45     1    51 
##    13    14    15  15.5    16    17    18    19  19.5    20    21    22 
##    55    68    79     1    58    89    80    84     1   101    93   102 
##    23  23.5    24    25    26    27    28  28.5    29    30  30.5    31 
##   110     1   118   111   129    99   112     1   160    99     1   132 
##    32    33    34    35  35.5    36    37    38  38.5    39  39.5    40 
##   109   112   128   129     2   127   111   102     1    89     1   103 
##  40.5    41  41.5    42  42.5    43  43.5    44  44.5    45    46    47 
##     1   104     2    86     1    63     1    75     4   101    64    91 
##    48  48.5    49    50  50.5    51  51.5    52  52.5    53    54    55 
##    66     7    82    64     2    54     1    72     4    68    61    58 
##    56    57    58    59  59.5    60  60.5    61  61.5    62    63    64 
##    42    44    37    39     2    38     2    47     1    29    30    23 
##  64.5    65    66    67    68    69    70  70.5    71    72    73  73.5 
##     1    14    17    22    24    17    11     1     5     6     8     4 
##    74    75    76    77  77.5    78    79  79.5    80    81    82  82.5 
##     5     7     5     5     1     4     2     4     1     7     2     1 
##    83    85    86    87    88    89    93    95    96    97    98   101 
##     4     2     2     4     1     1     1     1     3     1     3     2 
##   105   108   110   112 118.5 122.5   124   128   131 138.5 146.5   289 
##     2     3     1     1     1     1     1     1     1     1     1     1
# total wines below 10 is 223
# the proportion is 25.56%, which is much higher than choosing blindly (3.74%) but not enough

It seems like the wines with lower free sulfur dioxide volume have a higher chance to be rated as “Poor”. In wines with free sulfur dioxide volume below 10 mg / dm^3, the poor ones take 57 out of the total 223 (25.56%), which is much higher than picking blindly (183 out of 4,898, 3.74%). It’s a pretty good finding but it’s still not valid enough to distinguish poor wines from the others. It looks like that there are no clues to tell poor quality wines with the data we have so far.

It shows that there is connection between poor wine qualty and volume of free SO2 solely for when filtering the alcohol percentage equal or greater than 11. However, normal wine qualty seems to be unrelated with the volume of free SO2. It seems like there are no clues to tell poor and normal quality wines with the data we have.

How possible can we get a good or great bottle of wine?

wwd_gg <- wwd %>%
  filter(alcohol >= 11) %>%
  filter(densityLabel == 'Low') %>%
  filter(chlorides <= 0.035)
# apply the filters we discover from the analysis above
table(wwd_gg$rating) #show the counts
## 
##   Poor Normal   Good  Great 
##     16     30    252    314

Chance to get a good or great bottle of wine = (252+314)/(16+30+252+314) = 92.48%
Chance to get a great one: 314/(16+30+252+314) =51.31%

Compare to the original dataset (choose blindly):

table(wwd$rating) #show the counts
## 
##   Poor Normal   Good  Great 
##    183   1457   2198   1060

Chance to get a good or great bottle of wine = (2198+1060)/(183+1457+2198+1060) = 66.52%
Chance to get a great one: 1060/4898 =21.64%

The possibility that we can get a bottle of wine rated as “Good” or “Great” is: 92.48%, which is pretty high! Using the filter we would have more than 50% (51.31% actually) to get a great one. Both are much better than choosing blindly (66.52% and 21.64% separately).

Making prediction

Take alcohol into consideration

## # weights:  12 (6 variable)
## initial  value 6790.069781 
## iter  10 value 5163.815595
## final  value 5163.612823 
## converged
##         pred_mglm
##          Poor Normal Good Great
##   Poor      0     62  113     8
##   Normal    0    738  701    18
##   Good      0    523 1444   231
##   Great     0    118  629   313

Accuracy = (0+738+1444+313) / 4898 = 50.9%
Percentage of good / great prediction accuracy = (1444+231+629+313)/(2198+1060) = 2617/3258 = 80.33%

Now add density label into consideration

## # weights:  20 (12 variable)
## initial  value 6790.069781 
## iter  10 value 5334.216513
## iter  20 value 5145.630447
## final  value 5145.622716 
## converged
##         pred_mglm
##          Poor Normal Good Great
##   Poor      0     65  109     9
##   Normal    0    697  741    19
##   Good      0    518 1416   264
##   Great     0    115  620   325

Accuracy = (0+697+1416+325) / 4898 = 49.78%
Percentage of good / great prediction accuracy = (1416+264+620+325)/(2198+1060) = 2625/3258 = 80.57%

Maybe due to the high correlations between alcohol and density

Add chlorides into consideration

## # weights:  16 (9 variable)
## initial  value 6790.069781 
## iter  10 value 5228.246033
## final  value 5156.780043 
## converged
##         pred_mglm
##          Poor Normal Good Great
##   Poor      0     62  111    10
##   Normal    0    738  698    21
##   Good      0    523 1444   231
##   Great     0    118  603   339

Accuracy = (0+738+1444+339) / 4898 = 51.47%
Percentage of good / great prediction accuracy = (1444+231+603+339)/(2198+1060) = 2617/3258 = 80.33%

From the prediction output we can see that the accuracy is not really good, and all poor wines are overrated, and some of them are even predicted as “Great”, which means there is no effective method to distinguish between them.

## # weights:  42 (30 variable)
## initial  value 9531.067910 
## iter  10 value 6224.505038
## iter  20 value 5744.300614
## iter  30 value 5690.028735
## iter  40 value 5688.041338
## iter  50 value 5686.754626
## iter  60 value 5685.829132
## iter  70 value 5685.579996
## iter  80 value 5685.527502
## iter  90 value 5685.238424
## iter 100 value 5684.850792
## final  value 5684.850792 
## stopped after 100 iterations
## 
##   -4   -3   -2   -1    0    1    2    3 
##    1   17  110  922 2429 1188  211   20

FINAL PLOT1

Description One

I present 4 boxplots to show the relationships between rating and input variables I found to be correlated in prior analysis:
For chlorides on the top-left, we can see that more great wines have lower chlorides volume than the other 3 rating categories;
For free SO2 on the top-right, more than half of the poor wines contain free SO2 that is less than 20 mg / dm^3, which is the most valid difference between poor wines and the others;
When comparing with density, we can see that good and great wines are with lower density, especially the wines rated as “Great”; and
For alcohol on the bottom-right, it’s easy to tell that most great wines and many good wines contain a higher alcohol percentage than normal and poor ones.
From the boxplots we can also find that the differences between each rating compared with the input variables are not clear enough that we can easily distinguish between them.

FINAL PLOT2

Description Two

Although I can not find a exact pattern to predict wine quality properly, we can still have a pretty high chance, which is 92.48%, to get a bottle of wine rated at 6 or above out of 612 options through the listed filters:
Density below or equal to 0.9917;
Alcohol percentage equal to or above 11%; and
Chlorides volume below or equal to 0.035 g / dm^3
Besides, I also find two methods to get high quality wines easily and they would never fail. One is to pick wine with high alcohol percentage, which is equal or above 14, though we have very few choices that is only 7. The other is to pick wine with 13% alcohol and above and sugar volume greater than 1.5 g / dm^3. The choices are expanded to 114, which is much more.

FINAL PLOT3

Description 3

After deriving a prediction function by taking alcohol, density, and chlorides into consideration, we can make a good prediction on wine quality between 5 and 7 with prediction difference below or equal to 1. But for the great wines rated at 8 or above and the poor wines rated at 4 or below, most prediction items differ in 2 or even more from the actual quality. As the plot shows, no any extreme wines rated at the extremes of 3 and 9 are predicted properly with quality difference less than 2.

Reflections

In this project, I explored a moderately sized dataset on white wines. The dataset was provided in a clean format, without any missing data, and I didn’t need to augment it with any external sources. That said, the data only consisted of various chemical properties, and a quality score. Some of the key attributes in defining a wine’s quality, like region, vintage, grape variety, etc. were absent.

I used correlation values to find relationships between attributes. Where significant skew was present, I took logs, square roots, or cube roots. I found bimodal distribution in a feature, but it appeared to not be related to quality; it is likely that some confounders are causing the bimodal distribution, and this is worth investigating further.

With all the quality levels, the plots started looking messy. To make things more clear, I re-categorized wine quality into four groups. This helped quite a lot. I found interesting trends in the relationships between wine quality, alcohol percentage, density, and chloride volume, as well as a special relationship between sugar volume and high alcohol percentage. Although some trends were discovered, I found it to be pretty hard to plot the relationships properly. I tried scatter plot at first, but it was messy and hard to quantify. After googling for a long long time, I finally figured out some ways to plot them: boxplot and barplot, accompanying some quantification arguments such as varwidth in geom_boxplot, geom_text, and stat_summary, which were not practiced frequently or even not mentioned in video lectures.

I created a predictive model which was not very good at predicting exact scores. Although most differences between prediction output and actual quality are below or equal to 1, wines rated as “Poor” were all overrated. The model only took some input variables into consideration. Perhaps some higher order terms, and interaction terms could help in reducing the margin of error further.

Of course, to finish this project, I googled a lot of analysis reports on this dataset and copy some of their ideas. But it still took me a long time to finish because I had to figure out how to make my codelines work properly on my own. When receiving feedbacks from the 1st review, I really learned a lot about how to express relationships properly and tried to make the analysis report more readable, though it was very frustrating and tiring.

Finally, I found that the the relationships between quality of wine and other input variables varied a lot, but maybe we can find some patterns on two continued levels. For example, what factors make wine rated at 7 and 6 different? Or what factors make wine with 11% alcohol and 12% alcohol different? It might better interpret and analyze the relationships between variables.

Reference:
Loop in R reference: https://www.r-bloggers.com/how-to-write-the-first-for-loop-in-r/

ggpairs reference: http://stackoverflow.com/questions/39709745/decreasing-the-line-thickness-and-corr-font-size-in-ggpairs-plot https://www.rdocumentation.org/packages/GGally/versions/1.2.0/topics/ggpairs

Adding labels and text into the plot: http://stackoverflow.com/questions/3695497/show-instead-of-counts-in-charts-of-categorical-variables https://github.com/tidyverse/ggplot2/issues/1254 http://stackoverflow.com/questions/3483203/create-a-boxplot-in-r-that-labels-a-box-with-the-sample-size-n

Removing legend labels: http://stackoverflow.com/questions/14604435/turning-off-some-legends-in-a-ggplot

Reshape reference: http://stackoverflow.com/questions/20892266/multiple-plots-using-loops-in-r

Reorder reference: http://stackoverflow.com/questions/5620885/how-does-one-reorder-columns-in-a-data-frame

GGally reference: http://ggobi.github.io/ggally/#canonical_correlation_analysis

Other reference: http://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information